[8.15] output: Retry document-level `429`s by default (backport #13620) #13621

mergify · 2024-07-08T08:57:09Z

Motivation/summary

Updates the APM Server to automatically retry document-level 429s from Elasticsearch to avoid dropping data. It can be configured/overwritten by output.elasticsearch.max_retries, and defaults to 3.

It uses the default backoff configuration, which could wait up to 1m if enough retries are configured, but can be overwritten as well.

This reduces data loss whenever Elasticsearch is overhelmed and reduces the chances of data loss due to ES push-back.

Checklist

Update CHANGELOG.asciidoc
~~- [ ] Documentation has been updated~~

How to test these changes

This may need a bit of setup. It's thoroughly tested in the go-docappender, but we may want to verify that retries happen when document-level 429s happen.

I'd suggest writing a small "proxy" that returns 429s randomly and test that things are indexed (as duplicates).

Related issues

Closes #13619

This is an automatic backport of pull request #13620 done by Mergify.

Updates the APM Server to automatically retry document-level `429`s from Elasticsearch to avoid dropping data. It can be configured/overwritten by `output.elasticsearch.max_retries`, and defaults to `3`. It uses the default backoff configuration, which could wait up to 1m if enough retries are configured, but can be overwritten as well. --------- Signed-off-by: Marc Lopez Rubio <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit 28436dc) # Conflicts: # changelogs/head.asciidoc

mergify · 2024-07-08T08:57:12Z

Cherry-pick of 28436dc has failed:

On branch mergify/bp/8.15/pr-13620
Your branch is up to date with 'origin/8.15'.

You are currently cherry-picking commit 28436dc03.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   internal/beater/beater.go
	modified:   internal/beater/beater_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   changelogs/head.asciidoc

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Signed-off-by: Marc Lopez Rubio <[email protected]>

inge4pres · 2024-07-09T21:38:47Z

Unfortunately I haven't had the chance to test this out today, nor to investigate further how the change could be validated.

One good thing was mentioned in our weekly by @marclop: the daily benchmarks would give us decent feedback, so I think one interesting way to see if the change affects overall performance would be to run the benchmarks on this PR's branch?

1pkg · 2024-07-09T23:13:38Z

internal/beater/beater.go

-	// Default to 1mib flushes, which is the default for go-docappender.
-	esConfig.FlushBytes = "1 mib"
-	esConfig.FlushInterval = time.Second
-	esConfig.Config = elasticsearch.DefaultConfig()
 	esConfig.MaxIdleConnsPerHost = 10


nit: why not moving MaxIdleConnsPerHost into structure initialization as other fields?

Because it's part of Elasticsearch.Config which is set already with elasticsearch.DefaultConfig().

inge4pres · 2024-07-10T10:43:14Z

I tried running the benchmarks against the smallest instance (1GB) in https://github.com/elastic/apm-server/actions/runs/9872740538

inge4pres · 2024-07-10T11:57:30Z

I tried running the benchmarks against the smallest instance

benchmarks are broken due to Kibana (Kyung Eun had mentioned that in the standup, and now I recalled it 😓 )

inge4pres · 2024-07-11T07:31:56Z

Here's a methodology that I think can get us some feedback.

Create an ESS deployment with the smallest hardware profile (1GB, 1 zone)
Run the current 8.15 BC (or latest SNAPSHOT) without the changes in this PR in the integrations server, and use apmbench to ingest a number of events that will force ES to return 429s, likely 5-10 k events/sec should be enough
Install the apmserver from this PR into the integrations server
Run the same load
compare the results: APM Server metricsfrom Cloud dashboards, ES responses and status metrics

The goal will be to detect if ES has higher load or if the number of total ingested events differs between the 2 version.
If no significant changes are detected (+/- 0.5 sigma) IMO we can consider the change good to be merged.

Thoughts?

inge4pres · 2024-07-11T17:37:37Z

Testing methodology:

deploy ES to 2GB, 1 zones / APM to 2 GB
run apmbench for roughly 30 minutes

./apmbench -rewrite-ids -rewrite-timestamps -rewrite-span-names -server=https://<id>.apm.us-west2.gcp.elastic-cloud.com:443 -event-rate=5000/s -api-key="<redacted>" -benchtime=4m

compare the results with the metrics in the ESS dashboards (APM and ES)

Results

8.15 BC 1

8.15 + this PR

Notes: in the ES screenshots for the 8.15 + this PR case there is double load being ingested into APM server.
Inexplicably, I was not able to replicate 429s with the same setup across 2 separate ESS deployments, so I aimed at increasing the load. The APM server screenshots are referring to the same load applied in case 1.

Conclusions: given the double load on ES, the peak of 429s that also doubled is somewhat expected.
Latency responses from ES are tightly correlated with its write thread pool queue: when write threads are queued we get higher latency and ES starts returning 429s.
There was not a significant increase in CPU or memory usage in ES.

APM server did get an increase in memory used, with the average going from ~350 MiB to ~500 MiB. I can't explain how this happens.
CPU usage also looks similar, with the exception of the presence of CPU throttling for one data point.

With the data presented I think there is not a significant overhead imposed to APM server or ES by this PR, so I am inclined to merge it.
Thoughts?

marclop · 2024-07-12T02:38:46Z

Looking at the apm-server RSS before vs after, I do not see cause for concern, particularly since the load was doubled in the after benchmark. @inge4pres correct me if I'm wrong, but that seems to be the case.

Before

After

inge4pres · 2024-07-12T14:45:14Z

I do not see cause for concern, particularly since the load was doubled in the after benchmark. @inge4pres correct me if I'm wrong, but that seems to be the case.

I agree 👍🏼
We did see an increase in total memory usage, but not in presence of 429s: the branch with the backport (after) has higher allocated memory overall, but when the 429s are present, thre is no significant visible effect on APM Server.

mergify bot requested a review from a team as a code owner July 8, 2024 08:57

mergify bot added backport conflicts There is a conflict in the backported pull request labels Jul 8, 2024

mergify bot assigned marclop Jul 8, 2024

Fix conflicts, update changelog

7c05d56

Signed-off-by: Marc Lopez Rubio <[email protected]>

marclop removed the conflicts There is a conflict in the backported pull request label Jul 9, 2024

1pkg reviewed Jul 9, 2024

View reviewed changes

marclop approved these changes Jul 14, 2024

View reviewed changes

Merge branch '8.15' into mergify/bp/8.15/pr-13620

eab1e6c

mergify bot merged commit 008b12a into 8.15 Jul 15, 2024
16 checks passed

mergify bot deleted the mergify/bp/8.15/pr-13620 branch July 15, 2024 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.15] output: Retry document-level `429`s by default (backport #13620) #13621

[8.15] output: Retry document-level `429`s by default (backport #13620) #13621

mergify bot commented Jul 8, 2024

mergify bot commented Jul 8, 2024

inge4pres commented Jul 9, 2024

1pkg Jul 9, 2024

marclop Jul 10, 2024

inge4pres commented Jul 10, 2024

inge4pres commented Jul 10, 2024

inge4pres commented Jul 11, 2024

inge4pres commented Jul 11, 2024

marclop commented Jul 12, 2024

inge4pres commented Jul 12, 2024

[8.15] output: Retry document-level 429s by default (backport #13620) #13621

[8.15] output: Retry document-level 429s by default (backport #13620) #13621

Conversation

mergify bot commented Jul 8, 2024

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Jul 8, 2024

inge4pres commented Jul 9, 2024

1pkg Jul 9, 2024

Choose a reason for hiding this comment

marclop Jul 10, 2024

Choose a reason for hiding this comment

inge4pres commented Jul 10, 2024

inge4pres commented Jul 10, 2024

inge4pres commented Jul 11, 2024

inge4pres commented Jul 11, 2024

8.15 BC 1

8.15 + this PR

marclop commented Jul 12, 2024

Before

After

inge4pres commented Jul 12, 2024

[8.15] output: Retry document-level `429`s by default (backport #13620) #13621

[8.15] output: Retry document-level `429`s by default (backport #13620) #13621